2 research outputs found
FRASIMED: a Clinical French Annotated Resource Produced through Crosslingual BERT-Based Annotation Projection
Natural language processing (NLP) applications such as named entity
recognition (NER) for low-resource corpora do not benefit from recent advances
in the development of large language models (LLMs) where there is still a need
for larger annotated datasets. This research article introduces a methodology
for generating translated versions of annotated datasets through crosslingual
annotation projection. Leveraging a language agnostic BERT-based approach, it
is an efficient solution to increase low-resource corpora with few human
efforts and by only using already available open data resources. Quantitative
and qualitative evaluations are often lacking when it comes to evaluating the
quality and effectiveness of semi-automatic data generation strategies. The
evaluation of our crosslingual annotation projection approach showed both
effectiveness and high accuracy in the resulting dataset. As a practical
application of this methodology, we present the creation of French Annotated
Resource with Semantic Information for Medical Entities Detection (FRASIMED),
an annotated corpus comprising 2'051 synthetic clinical cases in French. The
corpus is now available for researchers and practitioners to develop and refine
French natural language processing (NLP) applications in the clinical field
(https://zenodo.org/record/8355629), making it the largest open annotated
corpus with linked medical concepts in French
FRASIMED: a Clinical French Annotated Resource Produced through Crosslingual BERT-Based Annotation Projection
The French Annotated Resource with Semantic Information for Medical Entities Detection (FRASIMED) contains 2'051 synthetic clinical cases in French, with 24'037 annotated entities. The dataset contains two subsets:
CANTEMIST-FR: Originally from CANTEMIST (Miranda-Escalada et al. (2020)), it contains 1'301 oncological notes, with 15'978 annotations linked to an ICD-O-3.1 morphology code. Additionally, 15’457 of them are linked to a SNOMED-CT code.
DISTEMIST-FR: Originally from DISTEMIST's training set (Miranda-Escalada et al. (2022)), it contains 750 clinical cases, with 8'059 annotations, with 5'132 of them linked to a SNOMED-CT code.
Please, cite us:
Zaghir, J., Bjelogrlic, M., Goldman, J.-P., Aananou, S., Gaudet-Blavignac, & Lovis, C. (2023). FRASIMED: a Clinical French Annotated Resource Produced through Crosslingual BERT-Based Annotation Projection. arXiv preprint http://arxiv.org/abs/2309.1077